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ABSTRACT 

Data Cleaning is a long standing problem, which is grow- 
ing in importance with the mass of uncurated web data. 
State of the art approaches for handling inconsistent data 
are systems that learn and use conditional functional de- 
pendencies (CFDs) to rectify data. These methods learn 
data patterns-CFDs-from a clean sample of the data and 
use them to rectify the dirty/inconsistent data. While get- 
ting a clean training sample is feasible in enterprise data 
scenarios, it is infeasible in web databases where there is 
no separate curated data. CFD based methods are unfor- 
tunately particularly sensitive to noise; we will empirically 
demonstrate that the number of CFDs learned falls quite 
drastically with even a small amount of noise. In order to 
overcome this limitation, we propose a fully probabilistic 
framework for cleaning data. Our approach involves learn- 
ing both the generative and error (corruption) models of the 
data and using them to clean the data. For generative mod- 
els, we learn Bayes networks from the data. For error mod- 
els, we consider a maximum entropy framework for combing 
multiple error processes. The generative and error models 
are learned directly from the noisy data. We present the 
details of the framework and demonstrate its effectiveness 
in rectifying web data. 

1. INTRODUCTION 

Real-world data is noisy and often suffers from corrup- 
tions that may impact data understanding, data modeling 
and decision-making. This situation is ubiquitous and even 
more severe when we deal with the web data generated by 
users or automated programs. For example, humans can 
introduce errors like typos and omitted data entries, and au- 
tomated approaches can introduce algorithmic errors such as 
inaccurate information extraction. Alleviating this problem 
needs data cleaning, i.e., catching and fixing corruptions in 
the data. In this paper, we focus on unsupervised cleaning 
for the uncurated structured data on the web rife with in- 
completeness and inconsistency. By identifying and curing 
noisy values, it is possible to gain deeper understanding of 
the data, improve models, or make better decisions. 

A variety of approaches have been proposed for data clean- 
ing, from traditional methods (e.g., outlier detection [o], 
noise removal 15 , and imputation [t]) to recent effort on ex- 
amining integrity constraints, e.g., functional/inclusion de- 
pendencies (FD/IND) 3 and their extensions (CFD/CIND) 



[5j[2]. Although these methods are efficient in their own sce- 
narios, they have severe drawbacks when cleaning the noisy 
web data because: (1) State of the art approaches (e.g., [s] 
[sj [2]) depend on the availability of a clean data corpus or 
external reference table to learn data quality rules/patterns 
before fixing the errors. Such clean corpora may be easy 
to establish in a tightly controlled enterprise environment 
but infeasible when on the web. One may attempt to learn 
data quality rules directly from the noisy web data. Unfor- 
tunately, as we will demonstrate in Section [S] this attempt 
fails to obtain any rules even with very small percentage of 
corruptions in the data; (2) Many other approaches (e.g., [o] 
[15 j ) are only concerned about identifying or removing noise 
(corruptions) rather than fixing them; (3) Some of the prior 
work (e.g., [l0j[7]) only focuses on fixing a single type of er- 
ror. This is inadequate on the web where multiple different 
kinds of corruptions could happen. 

We answer the web data cleaning problem by devising 
an end-to-end probabilistic framework on the available web 
data which involves learning a model of the clean data gen- 
eration process as well as an error model of the corrupt- 
ing process that introduces the noise. Then, by treating 
the clean value as a latent random variable, our framework 
leverages these two learned models and automatically infers 
its value through a Bayesian estimation. There are several 
advantages to this framework. First, modeling data prob- 
abilistically allows our framework to tolerate possible noise 
in the training datsQ In other words, it relaxes the limiting 
requirement in the existing approaches (i.e., building clean 
corpus in advance for learning deterministic data quality 
rules). Second, explicitly and naturally modeling the noise 
and the data corruption process through an error model im- 
proves the accuracy and robustness of the noise identification 
and fixing. For example, our error model can consider a wide 
spectrum of errors that occur commonly on the web (e.g., 
misspelling, replacement and deletion errors). On the con- 
trary, most state-of-the-arts have to characterize each type 
of errors and develop cleaning strategy for them separately. 
This is especially inconvenient when a new type of error is 
found or the noisy data contains multiple types of errors. 

We evaluate the proposed framework rigorously on a real- 
world dataset (used auto sales data). The results demon- 
strate the effectiveness and efficiency of our method with 
respect to different sizes of the data and various levels of 
noise in the data. 

To summarize, our main contributions are as follows: 
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1. We find that although CFD-based approaches are de- 
signed to capture and fix dirty data, ironicaUy, learning 
CFDs however depends on the availability of a per- 
fectly or largely clean data corpus. Through the em- 
pirical experiment, we show that CFD learner fails to 
discover any CFDs from a dataset which contains a 
very small percentage of noise (0.1%). Such discovery, 
to the best of our knowledge, has not been explored 
before, hence it is new. 

2. We propose an end-to-end probabilistic framework for 
cleaning the dirty web data. Our approach involves 
learning both the data generative model and error (cor- 
ruption) model from the input dirty dataset. For a pos- 
sible corrupted tuple, our framework leverages these 
two learned models and then automatically infers its 
value. 

3. We conduct extensive experiments to evaluate the ef- 
fectiveness and efficiency for our framework. The ex- 
periments are performed on a real web dataset with 
different types and levels of corruptions introduced. 

The rest of the paper is organized as follows. In Section[2] 
we discuss related work. Section [S] presents the performance 
of CFD-based approaches on real noisy web data. In Section 
[4] we present our approach. Quantitative evaluations are 
described in Section [5] We conclude the paper in Section |6] 

2. RELATED WORK 

Recent years have witnessed a significant research interest 
in data cleaning and enhancing data quality. A variety of 
approaches have been proposed with focus on noise elimi- 
nation, missing value prediction, and noisy value correction. 
Some of them work directly on detecting and removing data 
corruptions but without fixing them, such as outlier detec- 
tion [9] and noise removal [Ts]. On the other hand, some 
focus on fixing those corruptions alone, such as value im- 
putation [t]. More recently, integrity constraints-based ap- 
proaches have been proposed to capture and fix data corrup- 
tions such that the resulting database D' is either consistent 
and minimally differs from original database D or certain er- 
rors in D get fixed. These methods heavily use the editing 
rules which are generated from the (conditional) functional 
dependencies (CFDs or FDs) , (conditional) inclusion depen- 
dencies (INDs or CINDs) or matching dependencies (MDs) 
found from the data [3| ^ |2] [m] . 

The focus of most of the above works is to improve the 
quality of the data from a closed domain (e.g., census data 
or enterprise data) with a single type of error (either in- 
completeness or inconsistency). Therefore, it is not clear 
whether applying them to the noisy data on the web will 
work. This is because of the openness of the web where 
many kinds of errors may co-exist. Furthermore, most in- 
tegrity constraints-based approaches require rules which are 
deterministic and carefully tuned. However, given the un- 
certainties on the web, the performance of these approaches 
cannot be guaranteed. More importantly, learning editing 
rules requires a clean training corpus of high quality (an im- 
plicit assumption made in most of the work in this line, see 
discussions in Section [3|. However, such corpora are infea- 
sible to acquire on the web. To tackle these limitations, in 
this paper, we propose an end-to-end probabilistic frame- 



work which is designed to handle the data cleaning problem 
for the web data. 

3. LIMITATIONS OF CFD-BASED METH- 
ODS FOR CLEANING WEB DATA 

In this section, we present a understanding of how CFD 
based approaches work on real web data and show their in- 
abilities to clean the web data. As an example. Figure [l] 
shows the performance of a conditional functional depen- 
dency (CFD) learner on the real auto sales data with respect 
to difi^erent levels of noise (e.g., spelling errors, deletion er- 
rors, or replacement errors), which are generated randomly. 
The schema for this dataset is car(model, make, car-type, 
year, condition, drive-train, doors, engine) and the total num- 
ber of tuples was over 30,000. For the CFD learner, we 
directly used the one provided by the authors of 




% of data errors 

Figure 1: Learning CFD from dirty web data. It is 
clear that #CFD decreases but time increases w.r.t 
the growth of data errors 

Based on the graph, we make one key observation of the 
deficiency of CFD: with the growth of percentage of errors 
in the data, CFD dramatically finds fewer data editing rules. 
In specific, with only 1% errors in the data, the system is un- 
able to learn any rules As a result, the error data is basically 
not cleanable or unrepairable. We believe this is mainly due 
to the fact that: (1) the presence of corrupted values vio- 
lates possible patterns in the data, making them fractional 
and inconsistent; (2) On the other hand, finding CFD is de- 
terministic [4], i.e., CFD cannot tolerate any errors in the 
data patterns without any approximations. 

We find that the CFD-based methods only work well if 
the data is perfectly clean or largely clean (e.g., CFD learner 
found 70 or 31 rules when data is 100% or 99.9% clean as 
shown in Figure [T|. However, such assumption is rather 
unrealistic when we try to clean the real web data since 
the web is open and its noise rate would be possibly much 
higher than any controlled closed domain (e.g., enterprise 
database) which itself was reported to have an average 5% 
data errors [l^. Besides, CFD-based approaches are mostly 
used to make the data consistent (i.e., data patterns after 
cleaning tend to conform to these CFD rules) . However, it is 
not guaranteed that these fixed errors are the certain errors. 
To obtain the certain fixes, recent effort suggests to first 
acquire a clean master data, learn CFD there and apply the 
learned rules to clean data. Unfortunately, while these clean 
corpus might be easy to establish in a closed domain, it is 



hard to do so on the web. 

4. OUR APPROACH 

The observation mentioned above highlights the impor- 
tance of developing approaches that can really clean certain 
errors in dirty web data. In this section, we describe our 
model and the approach we propose to solving this prob- 
lem. 

4.1 Conceptual Model 

In this work, we view the data cleaning task as a statis- 
tical inference problem. Let T> = {Ti, ...,r„} be the input 
dataset. Ti is a tuple with m attributes {^li, Am}, which 
can be either clean or dirty, i.e., one or more attributes val- 
ues are corrupted. Let T* = {T^ , ...,T*} be a correction 
candidate set for a tuple T. Then, in order to clean T, the 
model is to find the most likely T* in T* (note that T* can 
be as same as T if T is a clean tuple): 



T* = arg max PrfT'lTl 



(1) 



In practice, instead of directly optimizing Equation [T] we 
can solve an equivalent problem by applying Bayes' rule and 
dropping the constant denominator. So that we have: 



= arg max Pr[r|r*lPr[T*l 

T*er* I- I J L J 



(2) 



So what is Pr[r|r*] and Pr[r*]? To answer this, let us 
first review how a tuple gets corrupted. We can view tu- 
ples T as being generated by a two-stage process. First, in 
the generation stage, a (noise free) tuple T* is generated 
according to an underlying "clean" probabilistic data model. 
Then, this tuple gets corrupted. Which attribute(s) are cor- 
rupted is determined by an underlying probabilistic "error" 
model and the "dirty" values for the corrupted attribute(s) 
are generated from that error model according to some prob- 
abilities. The actual representation stored in T) (and seen) is 
the tuple T. Therefore, Pr[T*] can be viewed as the "clean" 
data generative model and Pr[r|r*] can be viewed as the 
probabilistic "error" model. We summarize this error gener- 
ation process in Figure [2] 
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Calculating the data generative model PrfT*] needs to 
consider the dependencies between the attributes of possible 
clean tuple T* . Bayes network seems to be a good choice to 
model and quantify these correlations. 




Figure 3: The learned Bayes Network structure of 
Auto dataset 

Learning the Bayes Network usually involves two steps: 
learning the topology of the Bayes network and learning its 
conditional probability tables (CPTs). For the first step, we 
use the Bayesian learning package Banjo [s] and run it over 
the dataset D. Note that although D may contain noisy 
data, but unlike CFD approaches, Bayes network naturally 
models the data in a probabilistic way and thus can toler- 
ate such noise. Once we have the structure of the Bayes 
network BAf, we use Infer.NET package [Tl] to learn the 
parameters (aka conditional probability tables). The Bayes 
network thus learned represents Pr[r*] in a factored form. 
In particular, the probability of any specific true tuple T* 
can be read off as a joint probability entry from the Bayes 
networks. In Figure [3] we show a sample of the Bayes net- 
work structure learned from the auto dataset. 

4.3 Error Model 

Next, we need to estimate the probabilistic error model 
Pr[r|r*]. To simplify the learning of the error model, we 
assume that each attribute is corrupted independently of 
the other attributes. This allows us to learn the tuple error 
model as a product of the attribute error models. Specifi- 
cally, we have: 



Pr[r|T*]= n PrfrAjTlJ 



(3) 



Figure 2: Conceptual model of our approach 

A benefit of our approach based on generative and error 
models is its flexibility. For example, we can make the error 
model accommodate a wide spectrum of possible errors (a 
common limitation of current data cleaning approaches is 
to focus on single type of errors). Furthermore, one can 
create an error model to account for either dependent or 
independent corruptions. 

Now, given the two models, our task is to estimate them. 
We start with learning the generative model using Bayes 
networks and later building error model with a maximum 
entropy model. 

4.2 Data Generative Model 



As mentioned earlier, the error distribution described by 
our error model, Pr[TAjTJ .], is general enough to represent 
any kind of error, as long as the distribution is known. In 
this paper, we focus on three types of errors that we observed 
to be the most commonly occurring in the web data: spelling 
errors, replacement errors, and deletion errors. We present 
different strategies to characterize them, a summarization is 
presented in Equation |4] 



Pr[TAjrlJ = 



fed{TAi: 
fds (Ta, : 



n 



T 



A J 



if no error 

if spelling error 

otherwise 



(4) 



The error distribution of a spelling error is based on the 
edit-distance feature f^d (see its definition below). Other- 
wise, our estimation is based on the distributional similarity 
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Figure 4: The architecture of end-to-end probabilistic web data cleaning system. Our framework requires 
both data generative model and error model from the raw data. As mentioned in Section \4.2\ learning data 
generative model is based on a two-stage process as depicted in dashed boxes, respectively. 



feature fds (see its definition below). In such a case, the 
error model Pr [Ta J TJ ] can be regarded as the probability 
that one attribute value . is replaced by other value Ta^ . 
Note also that we can view a deletion error as a special case 
of the substitution error, i.e., the substituted value is empty 
(NULL value). 

Definition 4.1 (Edit-dtstance feature). This feature f^d is 
defined based on string edit-distance between two input tu- 
ple values. To present it in a probabilistic way, we use the 
definition in 

fed{TA^,TXJ = eM-ED{TA,,TXJ} (5) 

where ED{TAi , TX . ) is the number of edit operations required 
to transform attribute value TX ■ into Tai • 

Definition 4.2 (Distributional similarity feature). This fea- 
ture fds is defined based on the probability of replacing one 
value with another under a similar context. Formally, we 
have: 

fds{TA^,TAj= 2^ p^rn (6) 

cSC{T^.,T* J ^ J 

where C(TAi,TX ) is the context of a tuple attribute value, 
which is a set of attribute values that co-occur with both 
Ta^ andTX^. Fr[c\TXJ = {#{c,TXJ + fi)/#{TXJ is the 
probability that a context value c appears given the clean at- 
tribute TX- m the sample database. Similarly, P{TAi) = 
H^{TAi) / iftuples is the probability that a dirty attribute val- 
ues appears in the sample database. We calculate Pr[c|rAj 
and Pr[TAj in the same way. To avoid zero estimates for 
attribute values that do not appear in the database sample, 
we use Laplace smoothing factor fi. 

The following example illustrates how distribution simi- 
larity between features is computed. 

Example 4.1 Consider a tuple t: (Focus, Honda, JPN, Mid- 
size, \/6) from group 4 (di) in Table^ where the frequencies 
are based on the occurrences of certain attribute-values, e.g., 
100 tuples (such that they form a group) whose Model=Accord 
A Make=Honda A Size=Full-size A Engine=V6. Based on 
common knowledge, the value Focus might be dirty[^ There 
are two possible candidates for the correct value: Accord 
from g\ or g2, and Civic from gz. To determine which is 
the right one, we calculate distributional similarity features 
fds (Accord, Focus) and fds (Civic, Focus). 

First, we need to get the context C [Accord, Focus) . Note 
that, since there are two groups of Accord car with different 

^Focus is well-known Ford car. 



engines, the result of their distributional similarity to Focus 
in t is also different. Nevertheless, let Si be the set of all 
the attribute values m the tuples that contain Accord from 
gi. We have Si — {Honda, JPN, Full-size, V6}; Similarly, we 
have S2 ~ {Honda, JPN, Full-size, V6}, where S2 is the set of 
co-occurring attribute values of tuples that contain Focus in 
gi (since t is from g4,). Let the context C {Accord, Focus) = 
•Si n 5*2 = {Honda, JPN, Full-size, V6}. Applying Equation 
we can get fds{Accord, Focus) — 0.179 conditioned on 
gi and g^. Analogously, we can also get C {Civic, Focus) = 
{Honda, JPN} and fds{Civic, Focus) = 0.082. As a result, 
Accord IS the right candidate for dirty value Focus. 



Table 1: Sample database 



GID 


Model 


Make 


Orig 


CarTypc 


Engine 


Freq. 


91 


Accord 


Honda 


JPN 


Full-size 


V6 


100 


92 


Accord 


Honda 


JPN 


Full-size 


V4 


150 


93 


Civic 


Honda 


JPN 


Mid-size 


V4 


100 


94, 


Focus 


Honda 


JPN 


Full-size 


V5 


15 


95 


Focus 


Ford 


USA 


Compact 


V4 


105 



In practice, we do not know beforehand which kind of er- 
ror has occurred for a particular attribute. In other words, it 
is impossible to predict that definitely without knowing the 
clean version, TX- . Furthermore, it is also rather unrealistic 
to have a single definite error strategy for a given attribute of 
a tuple. In fact, we want a unified error model which can ac- 
commodate all three types of errors (and be fiexible enough 
to accommodate more errors when necessary) . For this pur- 
pose, we use the well-known maximum entropy framework 
ni to leverage all available features, including string edit 
distance-based feature fed and distributional-based feature 
fds- So for each attribute Ai, we have our unified error 
model defined on this attribute as follows: 

Pr[T4jTlJ = i exp {afed{TX^,TAj + Pfds{TX^,TA^)] (7) 

where a and /3 are the weight of each feature. Z = 
J^T* exp Ai/i(r* , T)} a normalization factor. To com- 
pute the entire error model for tuple T and T* , we just plug 
Equation [7] in Equation [3] 

4.4 Putting the Pieces Together 

We now describe the working of the system depicted in 
Figure [4] Our system runs as a standalone application on 
an offline database which may contain possible corruptions. 
We first tokenize the entire data and applying the Banjo 
package to learn the structure of the Bayes network for it. 
We then provide the learned structure together with the 
entire database to an inference engine (Infer.NET in our 
paper) for learning the CPTs. By completing this stage, we 




Figure 5: The percentage of corrupted values 
cleaned by the algorithms (using both features, only 
edit-distance feature, and only distributional sim- 
ilarity feature) as a function of the noise in the 
database. 

have a generative model of the data. In parallel, we define 
and learn an error model which incorporates three types of 
errors (call Section 4.3 1. Now we can begin cleaning the 



database tuple by tuple. For each tuple T in the database, 
we first find a set of its clean candidate T* = {IT, ...,T*} 
by looking at all the tuples in the database that are within 
a certain edit distance of T. Then, for each < T,T* > 
pair in the database, we now compute the Pr[r*|r] value 
using Equation [2] which itself is based two learned models as 
mentioned above. 

Last, we pick the one which maximizes Pr[T*|r] and deem 
it the best T* and store it as the clean copy of the tuple. 

5. EXPERIMENTS 

In this section, we quantitatively study the performance 
of our proposed approach on a large real datasets: Used car 
sales data. We present two sets of experiments on evaluating 
the approach in terms of (1) the effectiveness and (2) the 
efficiency. 

5.1 Experimental Setup 

To perform the experiments, we obtained the real data 
from the web. The first dataset is Used car sales dataset 
Dear which contains around 10,000 tuples crawled from Google 
Base. The schema of this dataset that we used in our exper- 
iments was car(model, make, car-type, year, condition, drive- 
train, doors, engine). We manually inspected the data to 
make sure it was clean and deemed the dataset "clean". We 
then introduced three types of noise to attributes in Dear- 
To add noise to an attribute, we randomly changed it either 
to a new value which is close in terms of string edit distance 
(distance between 1 and 4, simulating spelling errors) or to 
a new value which was from the same attribute (simulating 
replacement errors) or just delete it (simulating deletion er- 
rors). Such "dirty" dataset is referred to as "Dear"- We used 
a parameter r ranging from 0.1% to 5% for the noise rate. 

5.2 Effectiveness 

We now show the effectiveness of our algorithm in clean- 
ing the noise data in D'^ar, and demonstrate how the pa- 
rameters may be varied to obtain the desired results. In 
Figure [5] we show the resilience of the algorithm to noise 
in the input database -Dear- The weight (a) for the string 



Figure 6: The number of values corrected by the al- 
gorithm, the number of erroneous values introduced 
by the algorithm, and the overall increase in the 
number clean values generated. The x-axis shows 
the value of the parameter. 

edit distance feature fed was fixed at 2.3 while the weight 
(/3) for the distributional similarity feature fds was fixed at 
3.5. These values were chosen based on the results from Fig- 
ure [5] which we explain in the next paragraph. In addition, 
to evaluate the effectiveness of the maximum entropy model 
that we adopted, we compare its cleaning performance with 
the ones obtained by the cleaning algorithms that use one 
single type of features at a time (in other words, we set a = 
and /3 = in Equation[7]respectively and get accordingly re- 
sults). As we can see from this figure, all algorithms achieve 
substantial reduction in the noise of the data. Specifically, 
at 1% noise in the data, our algorithm which leverage all 
features corrects more than 31% errors in the data, whereas 
CFD based methods failed to find even a single CFD (call 
Figure [T| and are thus not able to fix any data corruptions. 
The number of false positives for each of these cases was less 
than or equal to 11 tuples (which is a very small percentage 
of the corpus size). Besides, it is clear that using the max- 
imum entropy model from to combine all features achieves 
better results than using them alone. 

Setting /3: Recall that in our approach we have two weights 
that can be adjusted: the weight given to the distributional 
similarity (/3), and the weight given to the edit distance (a). 
The ratio of these two weights depends on which kind of 
error is more likely to occur. We found that setting the edit 
distance weight to 0.667 x /3 yields the best results. Keeping 
this ratio fixed, in Figure [6j we show how the algorithm per- 
forms as j3 is changed. The "values corrected" data points in 
the graph correspond to the number of attribute values that 
were erroneous in the input data that the algorithm success- 
fully corrected (when checked against the ground truth). 

The "false positives" are the number of legitimate values 
that the algorithm changed to an erroneous value. When 
cleaning the data, our algorithm chooses a candidate tuple 
based on both the prior of the candidate as well as the like- 
lihood of the correction given the evidence. Low values of 
the parameter (3 give a higher weight to the prior than the 
likelihood. In other words, a lower value of the parameter 
indicates a higher likelihood of changing the tuple. As a re- 
sult, some legitimate tuples are "corrected" to a tuple that 
has a much larger prior. As (3 is increased, the number of 



such false positives reduces. However, this also reduces the 
number of values corrected, because some kinds of unlikely 
errors no longer justify the higher cost of correction. 

The "overall gain" in the number of clean values is calcu- 
lated as the difference of clean values between the output 
and input of the algorithm. In this particular experiment, 
there were 357 errors in the input data, of which the best 
correction was obtained at a parameter value of 3.0, where 
the overall gain was 87 clean values. 

5.3 Efficiency 

In Figure[7a]and Figure [Tb] we show the time taken by the 
algorithm (using Maximum entropy. This graph includes 
both the time taken to learn the generative model as well as 
the time taken to clean every tuple of the database. As can 
be seen, the algorithm completes in a reasonable amount of 
time, even with 10,000 tuples in the database. 
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Figure 7: Time taken by the algorithm to clean the 
database. In (a) we fixed with 0.5% data noise; In 
(b) we fixed #TupIes=5k. 

We show the effect of noise on the time taken by the al- 
gorithm in Figure [Tb] For this curve, the number of tuples 
was kept constant at 5,000 tuples. As can be seen from the 
figure, the time taken by the algorithm increases as the per- 
centage of noise in the data increases. This is because for 
every tuple that we have to clean, we have a much larger set 
of candidate T*s to consider. Adding noise to the dataset 
effectively increases the number of different tuples within 
the edit distance threshold of the data, thus a much larger 
number of error model comparisons need to be made. 

6. CONCLUSION 

In this paper, we focused on approaches for cleaning in- 
consistent web data. We showed that the current state of the 
art approaches, which learn and use conditional functional 
dependencies (CFDs) to rectify data, do not work well with 
web data as they demand clean master data for training. 
We proposed a fully probabilistic framework for cleaning 
data that involved learning both the generative and error 
(corruption) models of the data and using them to clean the 
data. For generative models, we learn Bayes networks from 
the data. For error models, we consider a maximum entropy 
framework for combing multiple error processes. The gen- 
erative and error models are learned directly from the noisy 
data. Preliminary experimental results on web data showed 
that our probabilistic approach is able to reduce errors in 
the data long after CFD-based methods fail to be effective. 
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